>Didn't know you and Eric were working together.
Andrew is my boss ;-)
>Actually, this reminds me of a former client who was on R5. They were a headhunting company and bought this Notes
>software for managing contractors and companies. It had the ungodliest views you've ever seen w/ lots of sortable
>columns, etc. They kept complaining it was slow and wanted to migrate off it. The databases were into the several
>GB range w/ these monster views. Their actual document range was roughly 600K.
>They also had weird issues w/ some documents on being in both replicas and occasional log messages about
>corrupt documents. Their workaround was to copy the database to the other server before continuing replication.
>Never did find out what was causing the occasional corrupt document; had an ugly incident about changing a live
>database w/o having a staging server even though I warned them that mission critical apps really should be tested in
>a separate environment before rollout :-P
Sounds similar to our problems. hehehe...
>Let us know what you find out. It's an intriguing problem at the very least (though I suspect if you've banged your head
>against the table for 3 months, you have another word for "intriguing" :-)
At this point we think it's a combination of 2-4 different issues.
- The socket errors, domino crashes, and insanely slow client database was most likely due to corrupt agent data documents (99% sure..)
- The current "mega corruption" problem (~10 databases going corrupt every day during heavy usage) seems to be a hardware problem. The corruption seems to happen during write operations when there is heavy usage on the server - during a compact or when there are 40-50 users on the server. Could be anything from RAM to Linux software raid.... We're moving everyone to some hardware we know is stable and renaming the server, hopefully that will clear up the issue and let be be 100% sure its that one hardware configuration. We're seeing just about every type of corruption you can imagine (I've personally cataloged over 8 different corruption errors, bitmap tables, documents, you name it)
- Back in ye-olde-days (last year) we would get corrupt documents when a bug in the system would cause a $ref hierarchy loop. This usually caused b-tree errors and/or document corruption. Try setting the $ref of a document to it's UNID and see what happens after awhile :-)
- We're still getting one or two databases that routinely go corrupt on a regular basis, say, every 3 months. This has been going on for as long as I can remember, and unfortinately we've almost accepted it as a fact of life. This even happened before all of our current troubles. Could be some rogue code we wrote, or could be a nasty document that just won't go away :-)
Thanks again for your help Ken, I'll let you know when things are running smoothly again - at least we can almost imagine having this problem solved now.. <grin>